{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第13章 实例分割"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13.1 简介"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "实例分割（Instance Segmentation）的目的是从图像中分割出每个目标实例的掩膜。与语义分割相比，实例分割不但要区分不同的类别，还需要区分不同的目标实例。如图13-1所示，语义分割的结果中，不同的羊对应的标签是一样的，而实例分割的结果中，不同的羊的类别标签一样，但是会有不同的实例号（id）；与目标检测相比，如图13-1所示，实例分割需要在每个边界框内再进一步将目标的掩膜分割出来。所以，实例分割相当于是语义分割和目标检测两个任务的融合。用数学语言来描述实例分割的问题即是：定义图像空间$\\mathbb{I}$和类别集合$\\mathcal{C}$，给定数据集$\\mathcal{D} = \\{(\\mathbf{I}^{(n)},\\mathbf{Y}^{(n)})\\}_{n=1}^N$，其中$\\mathbf{I}^{(n)}\\in \\mathbb{I}$是数据集中的第$n$张图像，$\\mathbf{Y}^{(n)}\\in\\{\\mathcal{C}\\times\\mathbb{N}\\}^{H^{(n)}\\times W^{(n)}}$是其对应的实例分割标签图，$H^{(n)}\\times W^{(n)}$是图像$\\mathbf{I}^{(n)}$的大小，$\\mathbb{N}$是自然数集，实例分割标签图中第$i$个条目$\\mathbf{y}_i^{(n)}=(y_i,z_i)$是图像中第$i$个像素对应的标签，$y_i\\in\\mathcal{C}$是该像素的类别标签，$z_i\\in\\mathbb{N}$是该像素从属实例的实例号；实例分割的任务是从$\\mathcal{D}$中学习得到一个从图像空间到实例分割标签图空间的映射$f:\\mathcal{I} \\rightarrow \\mathcal{Y}$，$\\forall n, \\mathbf{Y}^{(n)}\\in\\mathcal{Y}$，从而给定任意一张测试图像$\\mathbf{I}$，我们可以用学习得到的映射函数$f$预测该图像的实例分割标签图：$\\hat{\\mathbf{Y}}=f(\\mathbf{I})$。\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic4.zhimg.com/80/v2-f021bd0866f443ef25ea16020570044b_1440w.jpg\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图13-1 实例分割与语义分割的关系。</div>\n",
    "</center>\n",
    "\n",
    "在这一章中，我们将一起学习基于深度学习的实例分割方法，并动手实现相应的实例分割框架。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13.2 数据集与评测指标"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与目标检测类似，MS COCO也是最常用的实例分割数据集，mAP也是实例分割的评测指标。只是这里的mAP是通过Mask IoU进行度量。同目标检测中的Box IoU类似，对于图像中的一个目标，令其真实的掩膜内部的像素集合为$\\mathcal{A}$，一个预测的掩膜内部的像素集合为$\\mathcal{B}$，Mask IoU通过这两个掩膜内像素集合的IoU来度量两个掩膜的重合度：$\\texttt{IoU}(\\mathcal{A},\\mathcal{}B)=\\frac{\\mathcal{A} \\cap \\mathcal{B}}{\\mathcal{A} \\cup \\mathcal{B}}$，即图中绿色部分的面积除以红色的面积。同样通过调节Mask IoU的阈值，可以得到实例分割结果的PR曲线，PR曲线下方的面积即为实例分割结果的mAP。\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic3.zhimg.com/80/v2-a872e31df9f7497d7d547e1dd83717e2_1440w.webp\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图13-2 两个掩膜的重合度通过Mask IoU度量。</div>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13.3 Mask R-CNN\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "既然实例分割融合了目标检测和语义分割，那么实现实例分割的一个很直接的思路就是先目标检测，然后在每个目标检测框里做语义分割。著名的Mask R-CNN [1]模型就是采用的这一思路。从这个模型的命名可以看出，这是一个R-CNN系列工作的延续。Mask R-CNN也是由何凯明、Ross Girshick等人提出。Mask R-CNN获得了2017年国际计算机视觉大会（International Conference on Computer Vision，ICCV）最佳论文奖（Marr奖）。\n",
    "\n",
    "Mask R-CNN的结构如图 13-3 所示。不难发现，Mask R-CNN是Faster R-CNN得拓展，它们之间的主要区别有三点：\n",
    "\n",
    "1. Mask R-CNN引入特征金字塔网络（Feature Pyramid Network，FPN）[2] 作为主干网络；\n",
    "2. Mask R-CNN提出RoI Align取代Faster R-CNN中的RoI Pooling作为特征提取器；\n",
    "3. Mask R-CNN增加了一个用于预测掩膜（mask）的分支。\n",
    "\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic4.zhimg.com/80/v2-38ff78aa37a735b889c5201151340eef_1440w.jpg\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图13-3 Mask R-CNN结构框图。</div>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 13.3.1 特征金字塔网络"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于图像中目标存在多种尺度，为了能够检测到不同尺度的目标，目标检测模型通常需要多尺度检测架构。图 13-4展示了几种不同的用于目标检测的多尺度检测架构。\n",
    "(a)是早期的目标检测算法常用的图像金字塔结构，它通过将输入图像缩放到不同尺度从而构建图像金字塔，再从这些不同尺度的图像提取特征，得到每个尺度的特征图，最后分别在每个尺度的特征图上做目标检测。不难看出，使用图像金字塔最大的问题是推理速度慢了几倍，这是因为要推理的图像数多了几倍。(b)是Fast R-CNN、Faster R-CNN等目标检测模型采用的网络架构。考虑到多层的卷积神经网络蕴含了多尺度信息，这种架构只在主干网络的最后一层特征图上做目标检测。该架构最大的问题是对小尺寸的目标检测效果非常不理想，因为小尺寸目标的特征会随着逐层的降采样逐渐消失，因此最后一层已经有很少的特征支持小目标的精准检测。(c)从多层卷积网络的每一层都输出特征图，在每一层特征图上都进行目标检测。但这种架构只是单纯的从每一层导出预测结果，并没有进行层之间的特征交互，即没有给高层特征赋予浅层特征擅长定位小目标的能力，也没有给浅层的特征赋予高层蕴含到的语义信息，因此对目标检测效果的提升有限。\n",
    "\n",
    "(d)展示了FPN的架构，FPN的主要思想是构建一个特征金字塔，将不同层级的特征图融合到一起，从而实现多尺度目标检测。具体而言，FPN首先使用一个多层卷积网络得到每一层的特征图，越向上层的特征图尺度越小，这一过程也被称为自下而上。在此之后，FPN构建了一个自上而下的路径对不同尺度的特征图进行融合。首先FPN直接输出尺度最小的特征图（图(d)右边第3层，顶层）接着，对该特征图进行2倍的上采样，将其尺度大小和下一层级（图(d)左边第3层）的特征图的尺度保持一致，再通过卷积调整下一层级特征图的通道数，并将两者进行相加，得到融合的特征图（图(d)右边第2层）。这一过程被称为横向连接。在此之后，再对融合后的特征图2倍上采样，并在调整完成下一层级（图(d)左边第2层）的特征图通道数之后，将其进行融合。通过这种方式，FPN可以在不同尺度上获取特征，并将这些特征融合在一起，从而提高目标检测的准确性和效率。FPN已经被广泛应用于目标检测任务中，例如Faster R-CNN、Mask R-CNN等模型。\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic3.zhimg.com/80/v2-c60ec149033773cdc1698e1ad9f323b2_1440w.jpg\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图13-4 Mask R-CNN结构图。</div>\n",
    "</center>\n",
    "\n",
    "我们先学习FPN的代码实现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class FPN(nn.Module):\n",
    "    '''\n",
    "    FPN需要初始化一个list，代表ResNet每一个阶段的Bottleneck的数量\n",
    "    '''\n",
    " \n",
    "    def __init__(self, layers):\n",
    "        super(FPN, self).__init__()\n",
    "        # 构建C1\n",
    "        self.inplanes = 64\n",
    "        self.conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=7, stride=2, padding=3, bias=False)\n",
    "        self.bn1 = nn.BatchNorm2d(64)\n",
    "        self.relu = nn.ReLU(inplace=True)\n",
    "        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)\n",
    " \n",
    "        # 自下而上搭建C2、C3、C4、C5\n",
    "        self.layer1 = self._make_layer(64, layers[0])\n",
    "        self.layer2 = self._make_layer(128, layers[1], 2) # c2->c3第一个bottleneck的stride=2\n",
    "        self.layer3 = self._make_layer(256, layers[2], 2) # c3->c4第一个bottleneck的stride=2\n",
    "        self.layer4 = self._make_layer(512, layers[3], 2) # c4->c5第一个bottleneck的stride=2\n",
    " \n",
    "        # 对C5减少通道，得到P5\n",
    "        self.toplayer = nn.Conv2d(2048, 256, 1, 1, 0)  # 1*1卷积\n",
    " \n",
    "        # 横向连接，保证每一层通道数一致\n",
    "        self.latlayer1 = nn.Conv2d(1024, 256, 1, 1, 0)\n",
    "        self.latlayer2 = nn.Conv2d(512, 256, 1, 1, 0)\n",
    "        self.latlayer3 = nn.Conv2d(256, 256, 1, 1, 0)\n",
    " \n",
    "        # 平滑处理 3*3卷积\n",
    "        self.smooth = nn.Conv2d(256, 256, 3, 1, 1)\n",
    " \n",
    " \n",
    "    # 构建C2到C5\n",
    "    def _make_layer(self, planes, blocks, stride=1, downsample = None):\n",
    "        # 残差连接前，需保证尺寸及通道数相同\n",
    "        if stride != 1 or self.inplanes != Bottleneck.expansion * planes:\n",
    "            downsample = nn.Sequential(\n",
    "                nn.Conv2d(self.inplanes, Bottleneck.expansion * planes, 1, stride, bias=False),\n",
    "                nn.BatchNorm2d(Bottleneck.expansion * planes)\n",
    "            )\n",
    "        layers = []\n",
    "        layers.append(Bottleneck(self.inplanes, planes, stride, downsample))\n",
    " \n",
    "        # 更新输入输出层\n",
    "        self.inplanes = planes * Bottleneck.expansion\n",
    " \n",
    "        # 根据block数量添加bottleneck的数量\n",
    "        for i in range(1, blocks):\n",
    "            layers.append(Bottleneck(self.inplanes, planes)) # 后面层stride=1\n",
    "        return nn.Sequential(*layers)  # nn.Sequential接收orderdict或者一系列模型，列表需*转化\n",
    " \n",
    "        # 自上而下的上采样\n",
    "    def _upsample_add(self, x, y):\n",
    "        _, _, H, W = y.shape  # b c h w\n",
    "        # 特征x 2倍上采样(上采样到y的尺寸)后与y相加\n",
    "        return F.upsample(x, size=(H, W), mode='bilinear') + y\n",
    " \n",
    "    def forward(self, x):\n",
    "        # 自下而上\n",
    "        c1 = self.relu(self.bn1(self.conv1(x)))   # 1/2\n",
    "        c2 = self.layer1(self.maxpool(c1))      # 1/4\n",
    "        c3 = self.layer2(c2)                    # 1/8\n",
    "        c4 = self.layer3(c3)                    # 1/16\n",
    "        c5 = self.layer4(c4)                    # 1/32\n",
    " \n",
    "        # 自上而下，横向连接\n",
    "        p5 = self.toplayer(c5)\n",
    "        p4 = self._upsample_add(p5, self.latlayer1(c4))\n",
    "        p3 = self._upsample_add(p4, self.latlayer2(c3))\n",
    "        p2 = self._upsample_add(p3, self.latlayer3(c2))\n",
    " \n",
    "        # 平滑处理\n",
    "        p5 = p5  \n",
    "        # p5直接输出\n",
    "        p4 = self.smooth(p4)\n",
    "        p3 = self.smooth(p3)\n",
    "        p2 = self.smooth(p2)\n",
    "        return p2, p3, p4, p5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 13.3.2 RoI Align"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "在 RoI Pooling 层中，为了得到固定尺度大小的候选区域，需要将候选区域的坐标量化，如图 13-5 所示。\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic1.zhimg.com/80/v2-a0b38fba86c86daf1b5e4b2d0c3f2e88_1440w.jpg\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图 13-5 RoI Pooling过程中存在的量化现象。</div>\n",
    "</center>\n",
    "\n",
    "在上述过程中，将原始图像的候选区域映射到图像的特征图上后，该坐标可能是浮点数。通常情况下会将浮点数坐标量化到最近的整数坐标，但这样会引入误差，导致图像中的像素坐标与候选区域中像素坐标产生偏差，从而严重影响检测算法的性能。为了解决这个问题，RoI Align不再使用量化来处理浮点数坐标，而是直接使用这些浮点数坐标，如图 13-6 所示。\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic3.zhimg.com/80/v2-82c4f767d5bb35de6561d20eac5a21c2_1440w.jpg\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图 13-6 RoI Align中利用双线性插值法计算浮点数坐标的像素值。</div>\n",
    "\n",
    "</center>\n",
    "那么如何处理浮点数坐标并得到其对应位置的值呢？回忆一下之前学过的双线性插值，对于特征图中的任意一个位置，都可以利用双线性插值的方式得到其对应的数值。在这个过程中不会用到量化操作，不会引入误差，因此原图中的像素和特征图中的数值完全对齐，没有偏差。这不仅提高了检测的精度，还有利于实例分割。在此之后，同RoI Pooling中一样，在RoI Align中该候选区域将被均分成$M\\times N$个块（图中为$2\\times 2$个块，再计算每一个块特征值的均值，最终便可以得到尺度为$M\\times N$的候选区域。下面我们将介绍如何用代码实现RoI Align。\n",
    "\n",
    "\n",
    "<!--\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic1.zhimg.com/80/v2-fd9f97c09bd58792f1e93eb6744eacfc_1440w.jpg\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图 13-7 使用RoI Align可以使图像中的候选区域与特征图的候选区域坐标对齐。</div>\n",
    "\n",
    "</center>\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# RoI Align Layer\n",
    "class PyramidROIAlign(Layer):\n",
    "    def __init__(self, pool_shape, **kwargs):\n",
    "        super(PyramidROIAlign, self).__init__(**kwargs)\n",
    "        self.pool_shape = tuple(pool_shape)\n",
    " \n",
    "    def call(self, inputs):\n",
    "\n",
    "        # inputs 包含了必要的信息，如检测框的坐标、愿图像信息、特征层维度等\n",
    "        # 获得检测框的坐标\n",
    "        boxes = inputs[0]\n",
    "        \n",
    "        # 获取图像信息\n",
    "        image_meta = inputs[1]\n",
    "        \n",
    "        # 获取特征层信息[batch, height, width, channels]\n",
    "        feature_maps = inputs[2:]\n",
    " \n",
    "        # 获取检测框的宽高\n",
    "        y1, x1, y2, x2 = tf.split(boxes, 4, axis=2)\n",
    "        h = y2 - y1\n",
    "        w = x2 - x1\n",
    " \n",
    "        # 获取图像的大小\n",
    "        image_shape = parse_image_meta_graph(image_meta)['image_shape'][0]\n",
    "        \n",
    "        # 通过检测框的大小找到该检测框属于哪个特征层\n",
    "        image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)\n",
    "        roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))\n",
    "        roi_level = tf.minimum(5, tf.maximum(2, 4 + tf.cast(tf.round(roi_level), tf.int32)))\n",
    "        # 压缩roi_level的axis=2的轴，shape变化[batch, num_boxes,1]->[batch, num_boxes]\n",
    "        roi_level = tf.squeeze(roi_level, 2)\n",
    " \n",
    "        pooled = []\n",
    "        box_to_level = []\n",
    "        # 在网络的2到5层进行截取，其中2到5层分别将原图像尺寸压缩为1/4，1/8，1/16，1/32\n",
    "        for i, level in enumerate(range(2, 6)):\n",
    "            # 提取满足索引的对应boxes\n",
    "            ix = tf.where(tf.equal(roi_level, level))\n",
    "            level_boxes = tf.gather_nd(boxes, ix)\n",
    "            # 指定第i个方框要使用的图像，即指定第i个box对应batch中哪一张图像\n",
    "            box_to_level.append(ix)\n",
    " \n",
    "            # 获得这些检测框所属的图像\n",
    "            box_indices = tf.cast(ix[:, 0], tf.int32)\n",
    "\n",
    "            # 停止梯度下降\n",
    "            level_boxes = tf.stop_gradient(level_boxes)\n",
    "            box_indices = tf.stop_gradient(box_indices)\n",
    "            \n",
    "            ###########################################\n",
    "            # 利用双线性插值法对特征图进行截取  \n",
    "            #   [batch * num_boxes, pool_height, pool_width, channels]\n",
    "            ###########################################\n",
    "            pooled.append(tf.image.crop_and_resize(\n",
    "                feature_maps[i], level_boxes, box_indices, self.pool_shape,\n",
    "                method=\"bilinear\"))\n",
    " \n",
    "        pooled = tf.concat(pooled, axis=0)\n",
    "\n",
    "        # 将顺序和所属的图片进行堆叠\n",
    "        box_to_level = tf.concat(box_to_level, axis=0)\n",
    "        box_range = tf.expand_dims(tf.range(tf.shape(box_to_level)[0]), 1)\n",
    "        box_to_level = tf.concat([tf.cast(box_to_level, tf.int32), box_range], axis=1)\n",
    " \n",
    "        # box_to_level[:, 0]表示第几张图\n",
    "        # box_to_level[:, 1]表示第几张图里的第几个框\n",
    "        '''\n",
    "        由于所有经过RPN推荐产生的ROIs都经过refined，\n",
    "        因此其scale没有严格遵守RPN_ANCHOR_SCALES = (32, 64, 128, 256, 512)变化，\n",
    "        这会导致pooled_features的顺序与原始boxes的顺序不同，因此需要重新排序。\n",
    "        '''\n",
    "        # 排序原则是首先根据batch排序，再根据box_index排序\n",
    "        # 这里box_to_level[:, 0]为batch轴，×100000是为了加大权重，得到排序tensor\n",
    "        sorting_tensor = box_to_level[:, 0] * 100000 + box_to_level[:, 1]\n",
    "        \n",
    "        ix = tf.nn.top_k(sorting_tensor, k=tf.shape(\n",
    "            box_to_level)[0]).indices[::-1]\n",
    " \n",
    "        # 按顺序获得图片的索引\n",
    "        ix = tf.gather(box_to_level[:, 2], ix)\n",
    "        pooled = tf.gather(pooled, ix)\n",
    " \n",
    "        # 重新reshape为如下\n",
    "        # [batch, num_rois, POOL_SIZE, POOL_SIZE, channels]\n",
    "        shape = tf.concat([tf.shape(boxes)[:2], tf.shape(pooled)[1:]], axis=0)\n",
    "        pooled = tf.reshape(pooled, shape)\n",
    "        return pooled\n",
    " \n",
    "    def compute_output_shape(self, input_shape):\n",
    "        return input_shape[0][:2] + self.pool_shape + (input_shape[2][-1], )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在利用RoI Align层得到了固定空间大小的候选区域之后，便可以利用这些特征图进行后续的任务。同Faster R-CNN一样，Mask R-CNN利用全连接层对这些候选区域进行类别的预测以及坐标回归。除此之外，Mask R-CNN还提出了一条专门用于预测掩膜的分支，如图 13-2 所示。在这一分支中，候选区域首先会经过卷积层，提升特征图的维度，使之后预测掩膜更加准确。后续任务的代码实现如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 类别预测以及候选框回归\n",
    "def fpn_classifier_graph(rois, feature_maps, image_meta,\n",
    "                         pool_size, num_classes, train_bn=True,\n",
    "                         fc_layers_size=1024):\n",
    "    # 首先得到统一尺度空间的特征图\n",
    "    x = PyramidROIAlign([pool_size, pool_size], name=\"roi_align_classifier\")([rois, image_meta] + feature_maps)\n",
    "    # x: [batch, num_rois, POOL_SIZE, POOL_SIZE, channels]\n",
    "    # POOL_SIZE 为7， x的大小为 7*7*256\n",
    "\n",
    "    # 利用卷积进行特征整合\n",
    "    x = TimeDistributed(Conv2D(fc_layers_size, (pool_size, pool_size), padding=\"valid\"),  name=\"mrcnn_class_conv1\")(x)\n",
    "    x = TimeDistributed(BatchNormalization(), name='mrcnn_class_bn1')(x, training=train_bn)\n",
    "    x = Activation('relu')(x)\n",
    "    # x: [batch, num_rois, 1, 1, fc_layers_size]\n",
    "    # x: 1*1*1024\n",
    "    \n",
    "    # x: [batch, num_rois, 1, 1, fc_layers_size]\n",
    "    x = TimeDistributed(Conv2D(fc_layers_size, (1, 1)), name=\"mrcnn_class_conv2\")(x)\n",
    "    x = TimeDistributed(BatchNormalization(), name='mrcnn_class_bn2')(x, training=train_bn)\n",
    "    x = Activation('relu')(x)\n",
    "    # x: [batch, num_rois, 1, 1, fc_layers_size]\n",
    "    # x: 1*1*1024\n",
    "\n",
    "    shared = Lambda(lambda x: K.squeeze(K.squeeze(x, 3), 2),  name=\"pool_squeeze\")(x)\n",
    "    # x: [batch, num_rois, fc_layers_size]\n",
    "    # x: 1*1*1024\n",
    "    \n",
    "    # 接着，我们便可以利用x进行后续任务\n",
    "    # Classifier head\n",
    "    # 预测检测框内物体的种类\n",
    "    # mrcnn_probs: [batch, num_rois, num_classes]\n",
    "    mrcnn_class_logits = TimeDistributed(Dense(num_classes), name='mrcnn_class_logits')(shared)\n",
    "    mrcnn_probs = TimeDistributed(Activation(\"softmax\"), name=\"mrcnn_class\")(mrcnn_class_logits)\n",
    " \n",
    "    # BBox head\n",
    "    # 候选框回归\n",
    "    # mrcnn_bbox: [batch, num_rois, num_classes, 4]\n",
    "    x = TimeDistributed(Dense(num_classes * 4, activation='linear'), name='mrcnn_bbox_fc')(shared)\n",
    "    mrcnn_bbox = Reshape((-1, num_classes, 4), name=\"mrcnn_bbox\")(x)\n",
    "    \n",
    "    return mrcnn_class_logits, mrcnn_probs, mrcnn_bbox\n",
    " \n",
    "    \n",
    "#   建立mask模型\n",
    "#   这个模型会利用预测框对特征层进行ROIAlign\n",
    "#   根据截取下来的特征层进行语义分割\n",
    "#----------------------------------------------#\n",
    "def build_fpn_mask_graph(rois, feature_maps, image_meta,\n",
    "                         pool_size, num_classes, train_bn=True):\n",
    "    # 首先得到统一尺度空间的特征图\n",
    "    x = PyramidROIAlign([pool_size, pool_size], name=\"roi_align_mask\")([rois, image_meta] + feature_maps)\n",
    "    # x: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels]\n",
    "    # x: 14*14*256\n",
    "\n",
    "    # 接着，将x输入进4层卷积\n",
    "    x = TimeDistributed(Conv2D(256, (3, 3), padding=\"same\"), name=\"mrcnn_mask_conv1\")(x)\n",
    "    x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn1')(x, training=train_bn)\n",
    "    x = Activation('relu')(x)\n",
    " \n",
    "    x = TimeDistributed(Conv2D(256, (3, 3), padding=\"same\"), name=\"mrcnn_mask_conv2\")(x)\n",
    "    x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn2')(x, training=train_bn)\n",
    "    x = Activation('relu')(x)\n",
    " \n",
    "    x = TimeDistributed(Conv2D(256, (3, 3), padding=\"same\"), name=\"mrcnn_mask_conv3\")(x)\n",
    "    x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn3')(x, training=train_bn)\n",
    "    x = Activation('relu')(x)\n",
    " \n",
    "    x = TimeDistributed(Conv2D(256, (3, 3), padding=\"same\"), name=\"mrcnn_mask_conv4\")(x)\n",
    "    x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn4')(x, training=train_bn)\n",
    "    x = Activation('relu')(x)\n",
    "     \n",
    "    # x: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, 256]\n",
    "    # x: 14*14*256    \n",
    "\n",
    "    # 1层转置卷积进行上采样，将特征层扩大2倍\n",
    "    x = TimeDistributed(Conv2DTranspose(256, (2, 2), strides=2, activation=\"relu\"), name=\"mrcnn_mask_deconv\")(x)\n",
    "    # x: [batch, num_rois, 2xMASK_POOL_SIZE, 2xMASK_POOL_SIZE, 256]\n",
    "    # x: 28*28*256\n",
    "    \n",
    "    # 反卷积后再次进行一个1x1卷积调整通道，使其最终数量为numclasses，代表分的类\n",
    "    # 不难发现上述代码块其实是一个FCN       \n",
    "    \n",
    "    # x: [batch, num_rois, 2xMASK_POOL_SIZE, 2xMASK_POOL_SIZE, numclasses]\n",
    "    x = TimeDistributed(Conv2D(num_classes, (1, 1), strides=1, activation=\"sigmoid\"), name=\"mrcnn_mask\")(x)\n",
    "    return x\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在获得了对该候选区域预测的相关信息之后，我们便可以进行损失函数的计算了。对于该候选区域，Mask R-CNN的损失函数$L$定义如下：\n",
    "\n",
    "$$L = L_{cls}+L_{reg}+L_{mask}$$\n",
    "\n",
    "通过上式我们可以发现损失包含了三个部分：分类损失$L_{cls}$、坐标回归损失$L_{reg}$和掩膜分割损失$L_{mask}$。其中，分类损失与坐标回归损失的定义同Fast R-CNN中的一致，而最后一项的掩膜分割损失，用于优化目标物体的二进制掩膜。它使用像素级的二值交叉熵损失来衡量预测掩膜与真实掩膜之间的差距。\n",
    "\n",
    "在测试阶段，Mask R-CNN的整体步骤和训练相似。将图像经过预处理后利用RPN生成一系列候选区域，再利用RoI Align将其尺度进行统一。然后对这些候选区域进行类别的预测以及预测框的生成，同时预测其二进制掩膜。最后，利用NMS消除冗余的候选区域，即可得到最终的结果。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13.4 Mask R-CNN代码实现"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们先导入Mask R-CNN所需要的包。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! git clone https://github.com/matterport/Mask_RCNN.git\n",
    "! cd Mask_RCNN\n",
    "\n",
    "import os\n",
    "import sys\n",
    "sys.path.append('Mask_RCNN')\n",
    "os.chdir('./Mask_RCNN')\n",
    "os.getcwd()\n",
    "\n",
    "! python setup.py install\n",
    "\n",
    "\n",
    "from mrcnn.config import Config\n",
    "from mrcnn import model as modellib\n",
    "from mrcnn import visualize\n",
    "import cv2\n",
    "import colorsys\n",
    "import argparse\n",
    "import imutils\n",
    "import random\n",
    "import os\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接着，我们导入模型。这里，模型使用MS COCO数据集进行预训练。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SimpleConfig(Config):\n",
    "    # give the configuration a recognizable name\n",
    "    NAME = \"coco_inference\"\n",
    "    # set the number of GPUs to use along with the number of images\n",
    "    # per GPU\n",
    "    GPU_COUNT = 1\n",
    "    IMAGES_PER_GPU = 1\n",
    "    # number of classes on COCO dataset\n",
    "    NUM_CLASSES = 81\n",
    "    \n",
    "    \n",
    "config = SimpleConfig()\n",
    "config.display()\n",
    "model = modellib.MaskRCNN(mode=\"inference\", config=config, model_dir=os.getcwd())\n",
    "model.load_weights(\"mask_rcnn_coco.h5\", by_name=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后，我们便可以使用Mask R-CNN进行实例分割。我们先导入一张图像，并使用Mask R-CNN对其进行实例分割。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "image = cv2.imread('./images/3132016470_c27baa00e8_z.jpg')\n",
    "image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)\n",
    "image = imutils.resize(image, width=512)\n",
    "# perform a forward pass of the network to obtain the results\n",
    "print(\"[INFO] making predictions with Mask R-CNN...\")\n",
    "result = model.detect([image], verbose=1)\n",
    "\n",
    "r1 = result[0]\n",
    "visualize.display_instances(image, r1['rois'], r1['masks'], r1['class_ids'], class_names, r1['scores'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以观察到，每一个独立的物体都被捕捉到，并使用一个掩码去描述该物体。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13.5 小结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这一章中，我们学习了实例分割的基础知识，并动手实现了一个实例分割的代表性模型——Mask R-CNN。Mask R-CNN是基于R-CNN家族模型的一个集大成者，在多个领域中的表现都非常出色。在之后的章节中，我们将继续探索计算机视觉的另一个领域——人体姿势估计。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13.6 习题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. RoI Align层和 RoI Pooling层的区别在哪里？相比较之下它有什么优点？\n",
    "2. 请动手为Mask R-CNN更改骨干网络，比较不同骨干网络对模型效果的影响。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 参考文献"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross B. Girshick:\n",
    "Mask R-CNN. ICCV 2017: 2980-2988\n",
    "\n",
    "[2] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, Serge J. Belongie:\n",
    "Feature Pyramid Networks for Object Detection. CVPR 2017: 936-944"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
